Evaluation DGE algorithms

All the results were filtered by the adj.p-value < 0.05 and logFC >= 0.3

## An object of class Seurat 
## 48361 features across 12617 samples within 1 assay 
## Active assay: RNA (48361 features, 0 variable features)

Single cell methods:

For single cell methods I am going to consider both Seurat and Scanpy

Scanpy

here

For Scanpy I take in account the methods: - t-test_overestim_var - wilcoxon

Unfortunately the logistic regression provided by scanpy does not give LogFoldChanges and will not be taken in account.

## INFO [2023-01-13 11:45:27] $x
## INFO [2023-01-13 11:45:27] list(scanpy_t$Symbol, scanpy_wx$Symbol)
## INFO [2023-01-13 11:45:27] 
## INFO [2023-01-13 11:45:27] $category.names
## INFO [2023-01-13 11:45:27] c("scanpy_t", "scanpy_wx")
## INFO [2023-01-13 11:45:27] 
## INFO [2023-01-13 11:45:27] $filename
## INFO [2023-01-13 11:45:27] NULL
## INFO [2023-01-13 11:45:27] 
## INFO [2023-01-13 11:45:27] $disable.logging
## INFO [2023-01-13 11:45:27] T
## INFO [2023-01-13 11:45:27]

The t-test_overestim_var recognized as significative almost all the genes and include almost all the genes from the wilcoxon test. Nolw let’s evaluate the expression of the genes.

Scanpy T-test

Top upregulated genes in Exhausted NK cells:

Top downregulated genes in Exhausted NK cells:

Genes with lower abs(logFC)

Apparently Scanpy has a bias evaluating the logFC of genes that are completely missing in one of the conditions. But The genes are correctly plotted in the report. So I will check the expression by the z-score instead of the LogFC:

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

Better result for the extremes logFC (scores), but still not clear results for the low scored genes. Indeed those genes are expressed by a small proportion of cells:

##        Symbol  pvals_adj logfoldchanges   abs_lfc    scores NK.exhausted
## 43   Z83847.1 0.04939069     -21.810220 21.810220 -3.404695  0.000000000
## 44 AL603910.1 0.04926814     -21.742660 21.742660 -3.406065  0.000000000
## 45    RALGPS1 0.04926814      -2.078307  2.078307 -3.407628  0.001892148
##    NK.resident
## 43 0.002083333
## 44 0.002083333
## 45 0.008333333
## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

GO terms can be linked to immune cell activity, down terms (up for resident NK cells) are correlated with translation regulation.

Scanpy wilcoxon

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

Better differentially expressed genes. GO looks quite similar.

Seurat

here

For Seurat I take in account the methods: - t-test - wilcoxon - logistic regression

Similar numbers of DE genes were found by Seurat’s different flavours.

Seurat T test

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

Better result with seurat, the differences of expression are small (same cell types, just different state). In average the different gene expression is better appreciable

Enriched GO terms from upregulated genes are specific of immune activation. Downregulated GO terms are inherent to response to various factor, could it be a hint of the missing response to the cancer cell in exhausted NK cell?

Seurat wilcoxon

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

Similar to the previous

Seurat logreg

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

As above.

Pseudobulk

In the pseudobulks, counts were summed for each celltype and sample, in practice obtaining a column of counts for each celltype and sample. Then, genes that weren’t showing counts in more than 90% of the cells were removed. Also, genes showing expression below 10 counts in all the pseudobulks were removed (pseudobulks are the sum of cells counts. showing less than 10 counts indicate almost all the cells didn’t had count’s for that gene).

## [1] 8390   67

DESeq2

For DESeq2, firstly I run it with the “LRT” methods that showed better results here. The PCA plot and clustering clustering are mixed, probably because the division based on cell type here. I run DESeq2 specifing the model to take account of the batches.

Then, I tested combat_seq to correct the batch effect here. Also, I decided to see if there was a batch effect afrom unknown sources using the single variable analysis (sva) here.

## INFO [2023-01-13 11:45:31] $x
## INFO [2023-01-13 11:45:31] list(deseq2$Symbol, deseq2_combat$Symbol, deseq2_sva$Symbol)
## INFO [2023-01-13 11:45:31] 
## INFO [2023-01-13 11:45:31] $category.names
## INFO [2023-01-13 11:45:31] c("deseq2", "deseq2_combat", "deseq2_sva")
## INFO [2023-01-13 11:45:31] 
## INFO [2023-01-13 11:45:31] $filename
## INFO [2023-01-13 11:45:31] NULL
## INFO [2023-01-13 11:45:31] 
## INFO [2023-01-13 11:45:31] $disable.logging
## INFO [2023-01-13 11:45:31] T
## INFO [2023-01-13 11:45:31]

DESeq2 without batch correction found 147 DE genes, sva that corrected by batch effect without prion knowledge found alomt 1000 genes Combat correction found only 49 genes.

DESeq2-LRT

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Downregulated GO terms"

Differencies in the expression aren’t clear. No up-regulated pathways where found. Downregulated are similar to Seurat

DESeq2-combat

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

Result from combat are more clear in DGE. The Go terms looks similar to previous analysis

DESeq2-sva

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

Similarly to DESeq2, the genes differencies in expression aren’t clear. GO terms similar to seurat/deseq2

limma

Normal and with combat

limma limma combat

## INFO [2023-01-13 11:45:31] $x
## INFO [2023-01-13 11:45:31] list(limma$Symbol, limma_combat$Symbol)
## INFO [2023-01-13 11:45:31] 
## INFO [2023-01-13 11:45:31] $category.names
## INFO [2023-01-13 11:45:31] c("limma", "limma_combat")
## INFO [2023-01-13 11:45:31] 
## INFO [2023-01-13 11:45:31] $filename
## INFO [2023-01-13 11:45:31] NULL
## INFO [2023-01-13 11:45:31] 
## INFO [2023-01-13 11:45:31] $disable.logging
## INFO [2023-01-13 11:45:31] T
## INFO [2023-01-13 11:45:31]

Limma found more than 600 genes, no differences with the batch corrected data

Limma-voom

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

high LogFC aren’t well apreciable

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

Down-regulated GO terms more correlated to translation regulation

EdgeR

I tested edgeR with LRT as best option from here, I tested combat too. edgeR edgeR_combat

## INFO [2023-01-13 11:43:34] $x
## INFO [2023-01-13 11:43:34] list(edger$Symbol, edger_combat$Symbol)
## INFO [2023-01-13 11:43:34] 
## INFO [2023-01-13 11:43:34] $category.names
## INFO [2023-01-13 11:43:34] c("edgeR", "edgeR_combat")
## INFO [2023-01-13 11:43:34] 
## INFO [2023-01-13 11:43:34] $filename
## INFO [2023-01-13 11:43:34] NULL
## INFO [2023-01-13 11:43:34] 
## INFO [2023-01-13 11:43:34] $disable.logging
## INFO [2023-01-13 11:43:34] T
## INFO [2023-01-13 11:43:34]

aroung 500 genes found by edgeR

EdgeR

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

High and low LogFC aren’t clear. Low abs(logFC) are better appreciable. BP terms contains both response related patways and translation regulation pathways

EdgeR_combat

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

DE genes similar to EdgeR. BP translation-related

Wilcoxon

Tested a non-parametric test with a handwritten wilcoxon test (inspired from this)

wilcoxon

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

Differentially expressed genes not clear, BP (only up-regulated) similar to above.